graph LR
A["Install vLLM"] --> B["Load / configure<br/>model"]
B --> C["Start OpenAI-compatible<br/>API server"]
C --> D["Query from<br/>Python / curl"]
D --> E["Deploy to<br/>production"]
style A fill:#ffce67,stroke:#333
style B fill:#ffce67,stroke:#333
style C fill:#6cc3d5,stroke:#333,color:#fff
style D fill:#6cc3d5,stroke:#333,color:#fff
style E fill:#56cc9d,stroke:#333,color:#fff
Deploying and Serving LLM with vLLM
End-to-end guide: deploy and serve LLMs at scale with vLLM for high-throughput, low-latency inference
Keywords: vLLM, LLM serving, model deployment, inference optimization, PagedAttention, OpenAI API, batching, GPU inference, production LLM

Introduction
Serving Large Language Models (LLMs) in production requires more than just loading a model and running inference. You need high throughput, low latency, and efficient GPU memory usage to handle real-world traffic.
vLLM is an open-source library designed specifically for this purpose. It makes LLM serving:
- Fast (up to 24x higher throughput than naive serving)
- Memory-efficient (via PagedAttention)
- Production-ready (OpenAI-compatible API server)
- Easy to deploy (Docker, Kubernetes, cloud)
In this tutorial, we will walk through a complete pipeline:
- Install and configure vLLM
- Serve a model with the OpenAI-compatible API
- Query the model from Python
- Optimize for production deployment
What is vLLM?
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. Key features include:
- PagedAttention: Efficient memory management inspired by OS virtual memory, reducing GPU memory waste
- Continuous batching: Dynamically batches incoming requests for maximum throughput
- OpenAI-compatible API: Drop-in replacement for OpenAI API endpoints
- Tensor parallelism: Distribute models across multiple GPUs
- Support for many models: Llama, Mistral, Qwen, Phi, Gemma, and more
graph TD
A["vLLM Engine"] --> B["PagedAttention<br/>Memory efficiency"]
A --> C["Continuous Batching<br/>Max throughput"]
A --> D["OpenAI-compatible API<br/>Drop-in replacement"]
A --> E["Tensor Parallelism<br/>Multi-GPU support"]
A --> F["Wide Model Support<br/>Llama, Mistral, Qwen..."]
style A fill:#56cc9d,stroke:#333,color:#fff
style B fill:#6cc3d5,stroke:#333,color:#fff
style C fill:#6cc3d5,stroke:#333,color:#fff
style D fill:#6cc3d5,stroke:#333,color:#fff
style E fill:#6cc3d5,stroke:#333,color:#fff
style F fill:#6cc3d5,stroke:#333,color:#fff
Hardware Requirements
vLLM is designed for GPU inference. Minimum requirements depend on your model size:
| Model Size | Minimum GPU VRAM | Recommended GPU |
|---|---|---|
| 0.5B–3B | 4 GB | RTX 3060 / T4 |
| 7B–8B | 16 GB | RTX 4090 / A10 |
| 13B | 24 GB | A10 / A100 |
| 70B | 80 GB+ | A100 / H100 (multi-GPU) |
For CPU-only machines, consider using Ollama or llama.cpp instead.
Installation
Install vLLM
pip install vllmInstall with CUDA support (recommended)
pip install vllm[cuda]Verify installation
import vllm
print(vllm.__version__)Offline Inference (Batch Processing)
Use vLLM for fast batch inference without starting a server.
Basic Example
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256,
)
prompts = [
"Explain machine learning in simple terms.",
"What is the difference between AI and ML?",
"Write a Python function to reverse a string.",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
print("---")Chat-style Inference
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain vLLM in simple terms."},
]
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256,
)
outputs = llm.chat(messages=[messages], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)Serving with OpenAI-Compatible API
vLLM provides an API server that is fully compatible with the OpenAI API format.
graph LR
A["vLLM Server<br/>(port 8000)"] --> B["OpenAI-compatible<br/>/v1/chat/completions"]
B --> C["curl"]
B --> D["Python requests"]
B --> E["OpenAI Python client"]
B --> F["Any OpenAI-compatible<br/>application"]
style A fill:#56cc9d,stroke:#333,color:#fff
style B fill:#6cc3d5,stroke:#333,color:#fff
style C fill:#f8f9fa,stroke:#333
style D fill:#f8f9fa,stroke:#333
style E fill:#f8f9fa,stroke:#333
style F fill:#f8f9fa,stroke:#333
Start the Server
vllm serve Qwen/Qwen2.5-0.5B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--served-model-name my-model \
--chat-template ./chat_template.jinja \
--gpu-memory-utilization 0.90Key options explained:
--served-model-name: Sets the model name exposed in the API (clients use this name in requests instead of the full HuggingFace path)--chat-template: Path to a Jinja2 chat template file for formatting chat messages (useful for custom or fine-tuned models)--gpu-memory-utilization: Fraction of GPU memory to use (0.0–1.0, default 0.9). Increase for larger models, decrease to leave room for other processes
Start with Custom Parameters
vllm serve Qwen/Qwen2.5-0.5B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--served-model-name my-model \
--chat-template ./chat_template.jinja \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--dtype autoVerify the Server
curl http://localhost:8000/v1/modelsQuerying the API
Using curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-key" \
-d '{
"model": "my-model",
"messages": [
{"role": "user", "content": "What is vLLM?"}
],
"temperature": 0.7,
"max_tokens": 256
}'Using Python (requests)
import requests
response = requests.post(
"http://localhost:8000/v1/chat/completions",
headers={"Authorization": "Bearer your-secret-key"},
json={
"model": "my-model",
"messages": [
{"role": "user", "content": "Explain PagedAttention."}
],
"temperature": 0.7,
"max_tokens": 256,
}
)
print(response.json()["choices"][0]["message"]["content"])Using OpenAI Python Client (Recommended)
Since vLLM is OpenAI-compatible, you can use the official OpenAI client:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-secret-key",
)
response = client.chat.completions.create(
model="my-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is continuous batching?"},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)Serving Custom / Fine-tuned Models
If you fine-tuned a small LLM with Unsloth and exported it to GGUF (e.g., gguf_model_small), here is how to serve it with vLLM.
vLLM natively supports GGUF files — no conversion required. See the official vLLM GGUF documentation for full details.
Note: GGUF support in vLLM is experimental and under-optimized. Currently, only single-file GGUF models are supported. If you have a multi-file GGUF model, use
gguf-splitto merge them first.
graph TD
A["Fine-tuned model"] --> B{"Export format?"}
B -->|"GGUF"| C["Serve GGUF directly<br/>with vLLM"]
B -->|"HF safetensors"| D["Serve HF format<br/>with vLLM"]
B -->|"LoRA adapter"| E["Serve with<br/>--enable-lora"]
C --> F["OpenAI-compatible API"]
D --> F
E --> F
style A fill:#f8f9fa,stroke:#333
style C fill:#56cc9d,stroke:#333,color:#fff
style D fill:#56cc9d,stroke:#333,color:#fff
style E fill:#56cc9d,stroke:#333,color:#fff
style F fill:#6cc3d5,stroke:#333,color:#fff
Option A: Serve a GGUF file directly
Step 1: Prepare Your GGUF Model
After fine-tuning with Unsloth and exporting to GGUF, you should have a file like:
gguf_model_small/
├── added_tokens.json
├── chat_template.jinja
├── config.json
├── generation_config.json
├── merges.txt
├── model.safetensors
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer_config.json
├── unsloth.BF16.gguf
├── unsloth.Q4_K_M.gguf
└── vocab.jsonStep 2: Serve with vLLM
Point vLLM directly at the GGUF file. Use --tokenizer to specify the base model’s tokenizer (recommended over the GGUF-embedded tokenizer for stability):
vllm serve ./gguf_model_small/unsloth.Q4_K_M.gguf \
--tokenizer ./gguf_model_small \
--served-model-name my-finetuned-model \
--chat-template ./chat_template.jinja \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--gpu-memory-utilization 0.90 \
--max-model-len 2048You can also load GGUF models from HuggingFace using the repo_id:quant_type format:
vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
--tokenizer Qwen/Qwen3-0.6B \
--served-model-name qwen3-gguf \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--gpu-memory-utilization 0.90Add --tensor-parallel-size 2 to distribute across multiple GPUs:
vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
--tokenizer Qwen/Qwen3-0.6B \
--tensor-parallel-size 2 \
--api-key your-secret-keyStep 3: Verify and Query
curl http://localhost:8000/v1/models \
-H "Authorization: Bearer your-secret-key"from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-secret-key",
)
response = client.chat.completions.create(
model="my-finetuned-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is fine-tuning?"},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)Option B: Serve in Hugging Face format (safetensors)
If you prefer maximum compatibility (e.g., with LoRA adapters or features not yet supported with GGUF), export in HF format instead:
# During fine-tuning with Unsloth, save in HF format
model.save_pretrained_merged("hf_model_small", tokenizer)Then serve:
vllm serve ./hf_model_small \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--served-model-name my-finetuned-model \
--chat-template ./chat_template.jinja \
--gpu-memory-utilization 0.90 \
--dtype auto \
--max-model-len 2048Serve a LoRA Adapter (Without Merging)
If you prefer to keep LoRA weights separate, vLLM supports serving LoRA adapters on top of a base model:
vllm serve Qwen/Qwen2.5-0.5B-Instruct \
--enable-lora \
--lora-modules my-lora=./lora_model \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-keyThen query with the LoRA model name:
response = client.chat.completions.create(
model="my-lora",
messages=[{"role": "user", "content": "Hello!"}],
)Docker Deployment
Deploy vLLM in a container for production environments.
graph LR
A["vLLM Docker Image<br/>(vllm/vllm-openai)"] --> B["GPU Container<br/>(--gpus all)"]
B --> C["Model loaded<br/>in container"]
C --> D["Expose port 8000"]
D --> E["Production traffic"]
style A fill:#ffce67,stroke:#333
style B fill:#6cc3d5,stroke:#333,color:#fff
style C fill:#6cc3d5,stroke:#333,color:#fff
style D fill:#56cc9d,stroke:#333,color:#fff
style E fill:#56cc9d,stroke:#333,color:#fff
Dockerfile
FROM vllm/vllm-openai:latest
ENV MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct
CMD ["--model", "${MODEL_NAME}", "--host", "0.0.0.0", "--port", "8000"]Run with Docker
docker run --gpus all \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-0.5B-Instruct \
--host 0.0.0.0 \
--port 8000Performance Optimization Tips
- GPU memory utilization: Set
--gpu-memory-utilization 0.90to maximize GPU usage (range: 0.0–1.0) - Served model name: Use
--served-model-namefor cleaner API model names instead of long HuggingFace paths - Chat template: Use
--chat-templateto apply a custom Jinja2 chat template for fine-tuned models - Quantization: Use AWQ or GPTQ quantized models to reduce VRAM
- Tensor parallelism: Use
--tensor-parallel-size Nfor multi-GPU setups - Max model length: Reduce
--max-model-lenif you don’t need long contexts - Continuous batching: Enabled by default, handles concurrent requests efficiently
- Streaming: Use
stream=Truefor real-time token generation
vLLM vs Other Serving Solutions
| Feature | vLLM | Ollama | TGI | llama.cpp |
|---|---|---|---|---|
| Throughput | Very High | Medium | High | Low-Medium |
| GPU Required | Yes | Optional | Yes | Optional |
| OpenAI API | Yes | Partial | Yes | Partial |
| Multi-GPU | Yes | No | Yes | No |
| Ease of Use | Medium | Easy | Medium | Medium |
| Best For | Production | Local Dev | Production | Edge/CPU |
Conclusion
vLLM is the go-to solution for high-performance LLM serving in production:
- Serves models with an OpenAI-compatible API
- Handles high-concurrency with continuous batching
- Optimizes GPU memory with PagedAttention
- Supports custom and fine-tuned models
- Deploys easily with Docker and Kubernetes
This workflow is perfect for:
- Production AI APIs
- Enterprise LLM platforms
- High-traffic chatbot backends
- Multi-model serving infrastructure
Read More
- Combine with a RAG pipeline (LangChain + vLLM)
- Add load balancing with Nginx or Traefik
- Deploy on Kubernetes with GPU node pools
- Monitor with Prometheus + Grafana
- Serve multiple models with model routing